## [1] 113937     81

We have 81 variables specified in the data which is too large for the project. Here are the list of variables which should make much more sense for analysis of the loan data.

For the new stripped down data set, here are the details for it

## [1] 113937     19
##  [1] "Term"                       "LoanStatus"                
##  [3] "ClosedDate"                 "ListingCategory..numeric." 
##  [5] "BorrowerState"              "Occupation"                
##  [7] "IncomeRange"                "IncomeVerifiable"          
##  [9] "StatedMonthlyIncome"        "CreditScoreRangeLower"     
## [11] "ProsperScore"               "EmploymentStatus"          
## [13] "EmploymentStatusDuration"   "CurrentCreditLines"        
## [15] "TotalCreditLinespast7years" "DebtToIncomeRatio"         
## [17] "BorrowerRate"               "LoanOriginalAmount"        
## [19] "LoanOriginationDate"
## 'data.frame':    113937 obs. of  19 variables:
##  $ Term                      : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ ListingCategory..numeric. : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState             : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ IncomeRange               : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable          : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome       : num  3083 6125 2083 2875 9583 ...
##  $ CreditScoreRangeLower     : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ ProsperScore              : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ EmploymentStatus          : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration  : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ CurrentCreditLines        : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ TotalCreditLinespast7years: int  12 29 3 29 49 49 20 10 32 32 ...
##  $ DebtToIncomeRatio         : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ BorrowerRate              : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LoanOriginalAmount        : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate       : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  [1] "Cancelled"              "Chargedoff"            
##  [3] "Completed"              "Current"               
##  [5] "Defaulted"              "FinalPaymentInProgress"
##  [7] "Past Due (>120 days)"   "Past Due (1-15 days)"  
##  [9] "Past Due (16-30 days)"  "Past Due (31-60 days)" 
## [11] "Past Due (61-90 days)"  "Past Due (91-120 days)"
##  [1] ""                                  
##  [2] "Accountant/CPA"                    
##  [3] "Administrative Assistant"          
##  [4] "Analyst"                           
##  [5] "Architect"                         
##  [6] "Attorney"                          
##  [7] "Biologist"                         
##  [8] "Bus Driver"                        
##  [9] "Car Dealer"                        
## [10] "Chemist"                           
## [11] "Civil Service"                     
## [12] "Clergy"                            
## [13] "Clerical"                          
## [14] "Computer Programmer"               
## [15] "Construction"                      
## [16] "Dentist"                           
## [17] "Doctor"                            
## [18] "Engineer - Chemical"               
## [19] "Engineer - Electrical"             
## [20] "Engineer - Mechanical"             
## [21] "Executive"                         
## [22] "Fireman"                           
## [23] "Flight Attendant"                  
## [24] "Food Service"                      
## [25] "Food Service Management"           
## [26] "Homemaker"                         
## [27] "Investor"                          
## [28] "Judge"                             
## [29] "Laborer"                           
## [30] "Landscaping"                       
## [31] "Medical Technician"                
## [32] "Military Enlisted"                 
## [33] "Military Officer"                  
## [34] "Nurse (LPN)"                       
## [35] "Nurse (RN)"                        
## [36] "Nurse's Aide"                      
## [37] "Other"                             
## [38] "Pharmacist"                        
## [39] "Pilot - Private/Commercial"        
## [40] "Police Officer/Correction Officer" 
## [41] "Postal Service"                    
## [42] "Principal"                         
## [43] "Professional"                      
## [44] "Professor"                         
## [45] "Psychologist"                      
## [46] "Realtor"                           
## [47] "Religious"                         
## [48] "Retail Management"                 
## [49] "Sales - Commission"                
## [50] "Sales - Retail"                    
## [51] "Scientist"                         
## [52] "Skilled Labor"                     
## [53] "Social Worker"                     
## [54] "Student - College Freshman"        
## [55] "Student - College Graduate Student"
## [56] "Student - College Junior"          
## [57] "Student - College Senior"          
## [58] "Student - College Sophomore"       
## [59] "Student - Community College"       
## [60] "Student - Technical School"        
## [61] "Teacher"                           
## [62] "Teacher's Aide"                    
## [63] "Tradesman - Carpenter"             
## [64] "Tradesman - Electrician"           
## [65] "Tradesman - Mechanic"              
## [66] "Tradesman - Plumber"               
## [67] "Truck Driver"                      
## [68] "Waiter/Waitress"
## [1] "$0"             "$1-24,999"      "$100,000+"      "$25,000-49,999"
## [5] "$50,000-74,999" "$75,000-99,999" "Not displayed"  "Not employed"
## [1] ""              "Employed"      "Full-time"     "Not available"
## [5] "Not employed"  "Other"         "Part-time"     "Retired"      
## [9] "Self-employed"
##       Term                       LoanStatus                  ClosedDate   
##  Min.   :12.00   Current              :56576                      :58848  
##  1st Qu.:36.00   Completed            :38074   2014-03-04 00:00:00:  105  
##  Median :36.00   Chargedoff           :11992   2014-02-19 00:00:00:  100  
##  Mean   :40.83   Defaulted            : 5018   2014-02-11 00:00:00:   92  
##  3rd Qu.:36.00   Past Due (1-15 days) :  806   2012-10-30 00:00:00:   81  
##  Max.   :60.00   Past Due (31-60 days):  363   2013-02-26 00:00:00:   78  
##                  (Other)              : 1108   (Other)            :54633  
##  ListingCategory..numeric. BorrowerState  
##  Min.   : 0.000            CA     :14717  
##  1st Qu.: 1.000            TX     : 6842  
##  Median : 1.000            NY     : 6729  
##  Mean   : 2.774            FL     : 6720  
##  3rd Qu.: 3.000            IL     : 5921  
##  Max.   :20.000                   : 5515  
##                            (Other):67493  
##                     Occupation            IncomeRange    IncomeVerifiable
##  Other                   :28617   $25,000-49,999:32192   False:  8669    
##  Professional            :13628   $50,000-74,999:31050   True :105268    
##  Computer Programmer     : 4478   $100,000+     :17337                   
##  Executive               : 4311   $75,000-99,999:16916                   
##  Teacher                 : 3759   Not displayed : 7741                   
##  Administrative Assistant: 3688   $1-24,999     : 7274                   
##  (Other)                 :55456   (Other)       : 1427                   
##  StatedMonthlyIncome CreditScoreRangeLower  ProsperScore  
##  Min.   :      0     Min.   :  0.0         Min.   : 1.00  
##  1st Qu.:   3200     1st Qu.:660.0         1st Qu.: 4.00  
##  Median :   4667     Median :680.0         Median : 6.00  
##  Mean   :   5608     Mean   :685.6         Mean   : 5.95  
##  3rd Qu.:   6825     3rd Qu.:720.0         3rd Qu.: 8.00  
##  Max.   :1750003     Max.   :880.0         Max.   :11.00  
##                      NA's   :591           NA's   :29084  
##       EmploymentStatus EmploymentStatusDuration CurrentCreditLines
##  Employed     :67322   Min.   :  0.00           Min.   : 0.00     
##  Full-time    :26355   1st Qu.: 26.00           1st Qu.: 7.00     
##  Self-employed: 6134   Median : 67.00           Median :10.00     
##  Not available: 5347   Mean   : 96.07           Mean   :10.32     
##  Other        : 3806   3rd Qu.:137.00           3rd Qu.:13.00     
##               : 2255   Max.   :755.00           Max.   :59.00     
##  (Other)      : 2718   NA's   :7625             NA's   :7604      
##  TotalCreditLinespast7years DebtToIncomeRatio  BorrowerRate   
##  Min.   :  2.00             Min.   : 0.000    Min.   :0.0000  
##  1st Qu.: 17.00             1st Qu.: 0.140    1st Qu.:0.1340  
##  Median : 25.00             Median : 0.220    Median :0.1840  
##  Mean   : 26.75             Mean   : 0.276    Mean   :0.1928  
##  3rd Qu.: 35.00             3rd Qu.: 0.320    3rd Qu.:0.2500  
##  Max.   :136.00             Max.   :10.010    Max.   :0.4975  
##  NA's   :697                NA's   :8554                      
##  LoanOriginalAmount          LoanOriginationDate
##  Min.   : 1000      2014-01-22 00:00:00:   491  
##  1st Qu.: 4000      2013-11-13 00:00:00:   490  
##  Median : 6500      2014-02-19 00:00:00:   439  
##  Mean   : 8337      2013-10-16 00:00:00:   434  
##  3rd Qu.:12000      2014-01-28 00:00:00:   339  
##  Max.   :35000      2013-09-24 00:00:00:   316  
##                     (Other)            :111428

Summary

## 
##    12    36    60 
##  1614 87778 24545

Maximum number of term are of 36 months with 87778 number of entries, there are few 12 month loans compared to 36 and 60 month loan, surprisingly people have not opted for 48 month loan either they are going 1,3 or 5 year of loan term.

## 
##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304

There are 11992 charged off and 5018 defaulted loan statuses that’s around 16% of the loans has been defaulted or most probably to be defaulted, this seems a high number.

## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
## 16965 58308  7433  7189  2395   756  2572 10494   199    85    91   217 
##    12    13    14    15    16    17    18    19    20 
##    59  1996   876  1522   304    52   885   768   771

More that half of the loans are in debt consolidation category, next higher count excluding Not Available and Other category are in Home Improvements and Business.

## 
##     1     2     3     4     5     6     7     8     9    10    11 
##   992  5766  7642 12595  9813 12278 10597 12053  6911  4750  1456

From the looks of the histogram we can see that the result are showing up like a bell curve, where most of the data is around 4 - 8 while few having < 2 or > 10 scores.

## 
##                                                        Accountant/CPA 
##                               3588                               3233 
##           Administrative Assistant                            Analyst 
##                               3688                               3602 
##                          Architect                           Attorney 
##                                213                               1046 
##                          Biologist                         Bus Driver 
##                                125                                316 
##                         Car Dealer                            Chemist 
##                                180                                145 
##                      Civil Service                             Clergy 
##                               1457                                196 
##                           Clerical                Computer Programmer 
##                               3164                               4478 
##                       Construction                            Dentist 
##                               1790                                 68 
##                             Doctor                Engineer - Chemical 
##                                494                                225 
##              Engineer - Electrical              Engineer - Mechanical 
##                               1125                               1406 
##                          Executive                            Fireman 
##                               4311                                422 
##                   Flight Attendant                       Food Service 
##                                123                               1123 
##            Food Service Management                          Homemaker 
##                               1239                                120 
##                           Investor                              Judge 
##                                214                                 22 
##                            Laborer                        Landscaping 
##                               1595                                236 
##                 Medical Technician                  Military Enlisted 
##                               1117                               1272 
##                   Military Officer                        Nurse (LPN) 
##                                346                                492 
##                         Nurse (RN)                       Nurse's Aide 
##                               2489                                491 
##                              Other                         Pharmacist 
##                              28617                                257 
##         Pilot - Private/Commercial  Police Officer/Correction Officer 
##                                199                               1578 
##                     Postal Service                          Principal 
##                                627                                312 
##                       Professional                          Professor 
##                              13628                                557 
##                       Psychologist                            Realtor 
##                                145                                543 
##                          Religious                  Retail Management 
##                                124                               2602 
##                 Sales - Commission                     Sales - Retail 
##                               3446                               2797 
##                          Scientist                      Skilled Labor 
##                                372                               2746 
##                      Social Worker         Student - College Freshman 
##                                741                                 41 
## Student - College Graduate Student           Student - College Junior 
##                                245                                112 
##           Student - College Senior        Student - College Sophomore 
##                                188                                 69 
##        Student - Community College         Student - Technical School 
##                                 28                                 16 
##                            Teacher                     Teacher's Aide 
##                               3759                                276 
##              Tradesman - Carpenter            Tradesman - Electrician 
##                                120                                477 
##               Tradesman - Mechanic                Tradesman - Plumber 
##                                951                                102 
##                       Truck Driver                    Waiter/Waitress 
##                               1675                                436

There are lot of proffesions given here, I have removed some of the outliers of Others and Professional entries but still seeiing that much that on x axis is unreadable, I can do a axis flip so that we can see the occupation on y axis in full text.

There are maximum Computer Programmers

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 591 rows containing non-finite values (stat_bin).

There are few outliers with 0 credit score, need to remove those

## 
##     0   360   420   440   460   480   500   520   540   560   580   600 
##   133     1     5    36   141   346   554  1593  1474  1357  1125  3602 
##   620   640   660   680   700   720   740   760   780   800   820   840 
##  4172 12199 16366 16492 15471 12923  9267  6606  4624  2644  1409   567 
##   860   880 
##   212    27

Maximum people are in range 650 - 750, it would be interesting to compare the defaulters to the credit score ratings, people with lower ratings must be have high defaulting or charged off loan status.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

To get an general idea of where the most of people lie in, need to increase the binwidth for that

Mostly people are in range 3000 - 6000, monthly income should have high correlation with the monthly income.

## Loading required package: lubridate
## 
## Attaching package: 'lubridate'
## 
## The following object is masked from 'package:memisc':
## 
##     is.interval
## 
## The following object is masked from 'package:base':
## 
##     date

Highest number of loans are closed in the 2014 and from 2010 - 2013 it has remained constant.

## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
##  385 1351 2467 3553 4804 6367 7449 8945 8985 8731 8152 7500 6530 5677 4927 
##   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29 
## 3985 3468 2619 2242 1730 1377 1068  828  670  563  446  348  251  205  145 
##   30   31   32   33   34   35   36   37   38   39   40   41   42   43   44 
##  119   91   75   62   39   40   34   23   23   13   10    8    3    1    4 
##   45   46   47   48   51   52   54   56   59 
##    3    1    3    3    1    3    3    2    1

From the plot it seems that on an average people have around 7 - 12 credit lines, with some even having as far as 59 credit lines open.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8554 rows containing non-finite values (stat_bin).

Big outliers are in the plot, need to clean those, also need to set the binwidth to an lower amount to get better plot.

Debt to income ratio for majority lies aroung 0.25.

Univariate Analysis

What is/are the main feature(s) of interest in your dataset?

The main feature in this dataset is the Loan Status, Prosper Score and Credit Score relations. I think that there has to be a direct corelation between the probability of some defaulting a loan is attached to the prosper and credit score for the loanee.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Other features that might interest is the income range and the employment status, it is possible that someone with good credit score and good prosper score is unemployed for a while and his loan is going to be defaulted soon.

Did you create any new variables from existing variables in the dataset?

Yes I added a closed data year, this might help me visualize in which year most of the loans were closed and then again subset data and see how many of these closed loans were defaulted, cancelled or chargedof, I predicting during the recession around 2008 - 2010 the ratio of completed loans might be less than later years.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There were some outliers in credit score reange had to clean that up to get a good view of the income ranges. Then in viewing the occupations due to large number of entries needed to flip the axis to get a better view of the plot. In the Listing category plot my initial thought was auto loans or home loans might be the max number of loans listed, but from the plot the max number of loans were for debt consolidation which was quite surprising, I wen through some articles regarding this and people have stated that a debt consolidation loan is has higher probability than others to get defaulted, and still I see a very high number in that category.

Bivariate Plots Section

First I want to compare credit scores with monthly income, and see if there is any correlation between them or not.

## Warning: Removed 591 rows containing missing values (geom_point).

This data is not giving the correct picture due to various outliers here, probably I should consider monthly incomes less than 10000 and also remove data with 0 credit scores

Too many points are there, to get a nice idea of the plot, need to add alpha bending to the plot to get a better view of the plot of where the area is more dense in the scatter plot.

I can see some relation between which seems linear, adding a smooth line can sugest better where the plot is moving towards

Now we can see a clear smooth line moving in a linear direction, also the corelation between them is positive 0.22, this seems to me a low score, logically these 2 scores should have much higher correlations.

Next we can compare Prosper Score with the Credit Score.

Plot shows similar characteristics as shown in comparison in CreditScore vs Monthly Income, where many points are there and on smoothning the data only we can see a mostly linear relation between the two variables, but in the end even people with higher credit scores were having lower prosper scores. Correlation is 0.37 between these 2

Now I want to compare the Loan Status with the Credit Score, I woul dmake the binwidth as 20 to group some of the credit scores together to get a better view of the bar graph, also removing data with LoanStatus as Current

## LoanStatus: Cancelled
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     500     515     580     595     660     720       1 
## -------------------------------------------------------- 
## LoanStatus: Chargedoff
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   600.0   660.0   648.9   700.0   860.0      48 
## -------------------------------------------------------- 
## LoanStatus: Completed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   640.0   680.0   685.6   740.0   880.0     416 
## -------------------------------------------------------- 
## LoanStatus: Current
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   600.0   660.0   700.0   698.7   720.0   880.0 
## -------------------------------------------------------- 
## LoanStatus: Defaulted
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   560.0   640.0   620.9   680.0   860.0     126 
## -------------------------------------------------------- 
## LoanStatus: FinalPaymentInProgress
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   600.0   660.0   700.0   700.4   740.0   820.0 
## -------------------------------------------------------- 
## LoanStatus: Past Due (>120 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   640.0   675.0   680.0   687.5   700.0   780.0 
## -------------------------------------------------------- 
## LoanStatus: Past Due (1-15 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   600.0   660.0   680.0   687.6   720.0   860.0 
## -------------------------------------------------------- 
## LoanStatus: Past Due (16-30 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   600.0   660.0   680.0   682.2   720.0   820.0 
## -------------------------------------------------------- 
## LoanStatus: Past Due (31-60 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   600.0   660.0   680.0   691.5   720.0   820.0 
## -------------------------------------------------------- 
## LoanStatus: Past Due (61-90 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   600.0   660.0   680.0   688.8   720.0   820.0 
## -------------------------------------------------------- 
## LoanStatus: Past Due (91-120 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   600.0   660.0   680.0   690.2   720.0   820.0

In lower credit scores there are more ChargedOff and Defaulted loans compared to current and completed. In higher scores we can see the ratio of completed and current loans higher.

I am comparing LoanStatus with the year and see is their any relation between them or not, I am looking for does the recession after 2008 added the jumps in number of defaulted or charged off loans.

We can see number of defaulted and chargedoff suddenly rose after 2007 and it went on till 2010 after that the ratio is not that much.

I will compare states and Loan Status and see if there is some relation we can find here.

The distribution seems normal here, ratio for each state seems to be the same here.

I will now compare Listing category with the Loan Status.

A lot of people are in debt consolidation, and also we have higher number of defaulted, charged off in debt consolidation and unknown category, but I could not find any much different because the ratio is almost the same.

Next I want to comapre correlation between credit score and number of credit lines.

## loanData$CreditScoreRangeLower: 0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA     133 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 360
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA       1 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 420
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA       5 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 440
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA      36 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 460
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA     141 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 480
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA     346 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 500
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA     554 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 520
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    2.00    4.00    5.01    7.00   30.00     442 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 540
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   3.000   6.000   6.851   9.000  31.000     736 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 560
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   4.000   7.500   8.484  12.000  29.000     359 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 580
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   5.000   8.000   9.236  13.000  41.000     281 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 600
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00    9.00    9.49   13.00   45.00     604 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 620
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   6.000   9.000   9.798  13.000  52.000     537 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 640
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   5.000   8.000   9.102  12.000  48.000     621 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 660
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   6.000   9.000   9.697  12.000  54.000     398 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 680
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    7.00   10.00   10.57   13.00   54.00     425 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 700
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    7.00   10.00   10.97   14.00   59.00     283 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 720
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    7.00   10.00   11.09   14.00   56.00     312 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 740
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    7.00   10.00   10.99   14.00   51.00     217 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 760
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    7.00   10.00   10.98   14.00   44.00     180 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 780
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    7.00   10.00   11.07   14.00   43.00     151 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 800
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    7.00   10.00   10.93   14.00   38.00     115 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 820
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    7.00   10.00   10.96   13.00   39.00      72 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 840
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00    7.00   10.00   10.61   13.00   32.00      34 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 860
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.000   7.000   9.000   9.838  12.000  26.000      27 
## -------------------------------------------------------- 
## loanData$CreditScoreRangeLower: 880
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    3.00    7.00    8.50    8.75   10.00   18.00       3

People with higher credit lines tend to stay mostly around 700 range and usually people with higher credit score something like greater than 750 tend to have fewer credit lines.

I should also take a look into comparison of DebtToIncomeRatio and LoanStatus.

If I create the plot into one graph, there is nothing I can deduce from it, so I did a facet wrap to see how the curve changes in each individual category. So here if we compare defaulted to the completed cureves, completed one moves steep while going to 0.20 range and then comes down steeply but if you see in defaulted or even charged off plots the curve is not that steep, that means people with lower debt tend to default or have their loans charged off.

Comparing Employment Status and Loan Status.

In this case it seems that for Employed and Full time employment status people had mostly charged off their loan amounts instead of defaulted.

Next I want to see the status of loans started in a certain year, for this I would need to create LoanYear variable that would give us the year when the loan started.

We can see that maximum number of loans defaulted and charged off are from 2006 - 2008 while it has decreased by lot in after years, also in 2009 there is a major drop in loans initiated this might be due to the recovering economy from the recession.

Next I want to compare the relation between Credit Lines and Monthly Debt and analyse there correlation. For monthly debt I would need to multiple debt to income ratio to the monthly income variable

## Warning: Removed 16034 rows containing missing values (geom_point).

Need to clear some outliers where monthly debt is way higher, let smake it lesser than 8000 as most of the observations are within that range, and also moving credit lines less than 40

There is a high correlation between these 2 variables, correlation is around 0.595595, so generally people with higher debt tends to have higher number of credil lines open.

The correlation between Borrower Rate and Credit Score is negative -0.488 which seems logical as people with higher credit score get lesser loan rates.

Negative correlation of -0.33 which seems opposite of what I thought, larger amounts should be having higher rates due to higher risk involved in that.

Positive correlation between Credit Score and Loan Amount of 0.35, people with higher credit scores took higher loans.

Lots of loan given at 0.325 interest rate compared to others around.

Loan amounts decreased in year 2009 and after that it has been increasing and its comapratively way higher in 2013 and 2014 than other years.

Higher income group took out higher loan amounts, which seems intuitive.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  • Strong positive correlation between Monthly Debt and Credit Lines opened for an individual.
  • Positive correlation between monthly income and credit score which is logical.
  • Positive correlation between prosper score and credit score.
  • More number of defaulted or charged off loans in lower credit scores.
  • High number of people defaulted in 2008 - 2010 and these were the loans taken from 2006 - 2008.
  • Negative correlation between borrower rate and credit score, which is logical.
  • Negative correlation between borrower rate and loan amount.
  • Positive correlation between Loan amount and Credit Score.
  • Higher income group took higher loan amounts.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  • Loan amounts decreased in 2009 and has been going up since then, people are taking higher loans in 2013 and 2014 than before, this dip in 2009 must be due to the recession in 2006 - 2008.
  • Another interesting thing came up is the relation between Prosper Score and Credit Score, these two should have a high correlation but it was only 0.22 which seems very low.

What was the strongest relationship you found?

Strongest relationship was between Monthly Debt to number of Current Credit Lines for an individual.

Multivariate Plots Section

Higher Credit Scores, higher loan amount is given and lower borrower rate is also given, for better visualization need to strip down data for credit scores greater than 680 only.

In $100,000+, $25k-50k, $50k-75k and $75k-99k we can see definite increase in loan amounts, but we see drop in loan amounts for not employed persons and no increase for later years for $1-24k section.

People with higher debt to income ratio and higher credit scores have lesser prosper score, and people with higher credit scores but lesser debt to income ratio have higher prosper score, this explains the low correlation of prosper and credit scores.

Monthly debt and current credit lines do have high correlation and after adding credit score ranges as well people with lower credit scores have lower monthly debt and as we move above in the plot we can see mixed results, but people with higher credit scores and more credit lines do have higher monthly debt as well.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Looking at the loan amounts and borrower rates, people with higher credit scores were definitely given lower interest rates even with higher loan amounts, also another relation came up with higher the income higher the loan amount is.

Were there any interesting or surprising interactions between features?

Looking more into the low correlation between prosper score and credit score, I added debt to income ratio in this plot, this explained a lot of relation between these 2 variables, higher debt to income makes prosper score lower even for people with higher credit score.

Final Plots and Summary

Plot One

Description One

Year by year there have been changes in 2006 most of the people were given low rates, then in 2007 a bit of rates increased, then in 2009 people were taking lower loan amounts but still getting higher rates and same goes in 2010 lower loan amounts and still higher rates, then again rates started goin lower after 2011 and also the loan amounts started increasing after that in 2011, 2012 and 2013 and in 2014 we can see that rates have dropeed and amounts have increased quite a lot. Before 2014 we can also see clear demarcation between rates given to people with higher credit score but that demarcation has vanished in 2014.

Plot Two

Description Two

Loan Amounts have increased in 2013 and 2014 but still the debt to income ratio is much lower, while if we look in year 2007 and 2008 we can see many people are having worse debt to income ratio nearer to 1, in 2009 and 2010 most people maintained good debt to income ratio for their loans it again gets worse in 2011 and 2012. One more thing I can deduce people in low income range $1-24999 have poor debt to income ratio, also as we go up in income range we can see higher loan amounts as well.

Plot Three

Description Three

People in income ranges $1-24,999 and $25-49,999 have higher debt to income ratio and so they have mostly lower prosper score while in income ranges above 50k we can see more cluster is getting darker towards the right of the plot and also people are having lower debt to income ratio as well.

Reflection

The Loan data set had 114000 loan observation for years 2006 to 2014 with 81 variables, for this problem set I chose 22 variables for analysis. The difficulties I had at first was choosing correct and smaller dataset for my work, wanted to maintain a smaller dataset so I wen tin with 15 variables at first then while going through some of the analysis I wen through more variables and found that they might be providing better analysis or have better correlation with some of the oher variables in the set. I would have like to model around the data but I think I would need to take the next courses for that to implement. In the data through multiple plots I could see the company was struggling with loans, and the situation was worse during the recession years where they were probably to lenient in giving out loans to people even though people were having bad debt to income ratio, but I think they have recovered from that since 2012 where I can see they have applied some strictness and giving good loans only.